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Abstract —We establish bounds on the KL divergence between 
two multivariate Gaussian distributions in terms of the Hamming 
distance between the edge sets of the corresponding graphical 
models. We show that the KL divergence is bounded below by 
a constant when the graphs differ by at least one edge; this is 
essentially the tightest possible bound, since classes of graphs exist 
for which the edge discrepancy increases but the KL divergence 
remains bounded above by a constant. As a natural corollary to 
our KL lower bound, we also establish a sample size requirement 
for correct model selection via maximum likelihood estimation. 
Our results rigorize the notion that it is essential to estimate the 
edge structure of a Gaussian graphical model accurately in order 
to approximate the true distribution to close precision. 

I. Introduction 

Graphical models have enjoyed increasing popularity in 
a wide variety of scientific disciplines, including social net¬ 
works [1], computer vision [2], neuroscience [3], molecular 
biology [4], and clinical medicine [5]. Recent years have 
also seen substantial theoretical advances regarding graphical 
models in high-dimensional statistics (see, e.g., [ 6 ], [ 7 ], [ 8 ], 
[9], [10]). Broadly speaking, the goal of statistical estimation 
in graphical models is to (a) estimate the edge structure of 
the graph, which encodes conditional independence relation¬ 
ships between variables; and (b) infer the parameters of the 
distribution. The two goals are often treated separately: Upon 
determining the edges of the graph, the parameters are fit with 
respect to a reduced search space. This reduces the dimension¬ 
ality of the subsequent parameter estimation problem, which 
may be advantageous in high-dimensional problems where 
the underlying graph is sparse. We consider a setting where 
data are collected in the form of joint observations; in high¬ 
dimensional scenarios, the number of nodes is assumed to be 
much larger than the number of observations. 

However, when parameter estimation is conducted in the 
wake of edge estimation, inaccuracies in the estimated graph 
structure will propagate to the parameter estimation step. 
Although superfluous edges may subsequently be removed by 
setting the corresponding parameters to zero, missing edges 
in the estimated graph may lead to model misspecification. 
Consequently, the estimated distribution may be far from 
the actual distribution. Various authors (e.g., [6], [11], [12], 
[9], [13]) have established sufficient conditions for specific 
estimation procedures that guarantee correct edge recovery, 
albeit under fairly stringent conditions that are more restrictive 
than the conditions needed for t-\ - and G-consistency. 
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In this paper, we explore the following question: If the 
edge structure of the graph is estimated incorrectly, how large 
is the deviation between the true distribution and the closest 
fit with respect to the errant graphical model? We restrict our 
attention to Gaussian graphical models. Our main contribution 
is to establish a constant lower bound on the KL divergence 
between the true distribution and the closest approximation 
when the graphs differ by even a single edge. This should be 
viewed in conjunction with the work of Zhou et al. [14], who 
establish upper bounds on the edge discrepancy for a certain 
graph estimation procedure. Indeed, our result stipulates the 
need to identify the edge structure of the true graphical model 
with complete accuracy in order to approximate the underlying 
distribution to arbitrary precision. 

Our results have interesting connections to other lines of 
previous work. Theorem 1 below relates the KL divergence 
between two Gaussian distributions with different graphical 
models to the conditional mutual information between the pair 
of variables corresponding to the edge discrepancy. Bounds on 
a similar conditional mutual information expression are used 
to derive sample complexity results for a graphical model 
estimation procedure proposed by Anandkumar et al. [13], 
[15], and indeed, our Lemma 1 is similar to a proposition 
proved in that paper. However, rather than focusing on require¬ 
ments for statistical consistency of a particular graphical model 
estimation algorithm, we leverage this lemma to lower-bound 
the KL divergence between distributions in terms of entries of 
the inverse covariance matrix. In a recent paper, Bresler [16] 
provides lower bounds on the conditional mutual information 
between pairs of variables in an Ising model, although it is 
unclear whether that result could be used to derive a similar 
constant lower bound on the KL divergence between Ising 
models with differing graphical structure. 

The remainder of the paper is organized as follows: In 
Section II, we provide a precise mathematical formulation 
of the problem under consideration and introduce relevant 
notation. Section III contains statements of our main theorems, 
where we first lower-bound the KL divergence in terms of the 
conditional mutual information and then in terms of a constant 
parameter defined according to entries of the true inverse 
covariance matrix. We then discuss an easy consequence of 
the KL bound regarding the sample complexity of a likelihood- 
based approach for model selection. We close in Section IV 
with some extensions of our KL lower bound and an example 
showing that the KL separation does not necessarily grow in a 


meaningful way with the Hamming distance between the edge 
set of the true graph and a candidate estimator. Detailed proofs 
may be found in the arXiv version of the manuscript [ 17 ], 

Notation: For functions f(n) and g(n), we write f(n) ^ g[n ) 
to indicate that f(n) < cg{n) for some universal constant 
c £ (0, oo), and similarly, we write /(n) A g(n) when 
f(n) > c'g(n) for some universal constant c' £ ( 0 ,00). We 
use the symbol _LL to indicate independence. For a matrix 
M, we write |||M|| F to denote the Frobenius norm, and let 
Amax(A^) denote the maximum eigenvalue of M. We write 
M(i,j) to denote the (*,j) th entry of M and supp(M) := 
: i < j and 7^ 0} to denote the (ordered) 

support of M. Finally, vec(M) denotes the vectorized version 
of the matrix and \M\ = det(M) denotes the determinant. 

II. Background and problem setup 

Consider a zero-mean multivariate Gaussian distribution 
<7© := A(O,0 _1 ) with inverse covariance matrix 0 £ W xp . 
Recall that the Gaussian graphical model corresponding 
to the distribution <7© is given by the undirected graph 
G(0) = (V,E), where V = { 1 ,... ,p} and E = supp(0) is 
the support of the matrix 0. This is a special case of the well- 
developed theory on undirected graphical models, also known 
as Markov random fields, where nodes represent individual 
variables in a joint distribution A' = (Ai,..., X p ), and miss¬ 
ing edges represent conditional independence relationships 
between subsets of variables. In particular, (i,j) £ E implies 
that X, _LL Xj \Xuj\c, where we write Xu j\e to denote the 
collection of variables {A' 1; ..., X p } \ {A i; Xj}. For a more 
detailed exposition on graphical models, see Lauritzen [ 18 ] or 
Roller and Friedman [ 19 ] and the references cited therein. 

We now consider a pair of p-dimensional multivariate 
Gaussian distributions qi = g© x and (72 = qe 2 . Our main 
goal in this paper is to quantify the distance between <71 and 
<72 in terms of the discrepancy between G 1 = G(0i) and 
G2 = G(02). The distance between <71 and <72 is measured 
via the Kullback-Leibler divergence between <71 and <72: 

KL( qi \\q 2 )=[ <71 (x) log 

J RP 92 {x) 

For a fixed distribution <71 defined over G1, we wish to 
find the infimum inf 92 ivL(<7i||<72), where <72 ranges over all 
distributions defined over G2. Note that if G\ is a subgraph 
of G2, the value of this infimum may approach 0 if we tend 
02 (i,j) -A 0 for (i,j) £ E 2 \ E\. Hence, we insist that there 
is at least one edge (i,j) £ £1 such that [i. j) (j E 2. 

Some of our results will be stated in terms of particular 
classes of positive definite matrices. Let 

floo(a,h) := {0 >- 0 : Q{i,i) < h, V«; and 

| 0 (*,j)l > a, V(i, j) s.t. Q(i,j) 7^ 0 }, 

and 

n F (j) := {0 >- 0 : 101 F < 7}, 

where imposes bounds on individual entries and f l p 
imposes a uniform bound on the Frobenius norm. Note that a 
similar class to Hoc, with an additional upper bound on the off- 
diagonal entries, was analyzed by previous authors for Ising 
models [10], [ 16 ], and a heuristic justification for entrywise 


restrictions on the inverse covariance class vis-a-vis identi- 
fiability may be found in Santhanam and Wainwright [ 10 ], 
As explained in the remark following Corollary 1 below, the 
ratio ^ may be viewed as a surrogate for the minimum signal 
strength of a multivariate Gaussian distribution with inverse 
covariance matrix 0 . Furthermore, the fact that 0 is positive 
semidefinite implies that 0(i, j) < h, for i 7^ j. 

III. Main results and consequences 

We now present our core theoretical results. We begin with 
the following theorem, which quantifies the KL divergence 
between an arbitrary distribution <71 and a distribution <72 taken 
from a class with at least one edge missing. 

Theorem 1 . Let X = (A]...., X p ) be drawn from a multi¬ 
variate distribution with density q\. Then 

min KL(q 1 \\q 2 )>I(X 1 -,X 2 \X 3 ,...,X p ), ( 1 ) 

<?2:A'i_LLA'2 |A{ 1i2 }c 

where I(X\\ X2IA3,..., X p ) denotes the conditional mutual 
information with respect to the distribution q-[. Equality is 
achieved when <72 = q%, where 

ql{x i,x 2 , ...,x p ) := qi(xi\x 3 , ...,x n ) 

■ qi(x 2 \x 3 , ...,x n )- qi(x 3 , . . .,X n ). 

Remark: Note that we do not impose any distributional 
assumptions on either qi or q- 2 . Furthermore, if the edge ( 1 , 2 ) 
is also absent in the graphical model representation of <71, we 
have J(Ai; X2IA3,..., A p ) = 0 . Consequently, equality is 
achieved in equation (1) with both sides equal to 0. 

When the variables are jointly Gaussian, it is 
possible to express the conditional mutual information 
J(Ai; A2IA3,..., Ap) cleanly in terms of the inverse 
covariance matrix of <71. Our next result accordingly lower- 
bounds the KL divergence between two multivariate Gaussian 
distributions </©• and <7© in terms of the quantity 

r> . = • f 0*(M)0*(j, j) \ 

Ce ' (i,jy.e»(i,j)jto\e*(i,i)Q*(j,j) - 0 *{i,j ) 2 J ‘ 

Note that when 0 * >- 0 , each 2 x 2 submatrix of 0 * over the 
indices i and j is also positive definite, so Q* (i,i)Q* (j, j) — 
Q*(i,j ) 2 > 0 . Hence, c©* > 1 , and the quantity appearing in 
the lower bound of Theorem 2 strictly positive. 

Theorem 2 . Consider a fixed 0 * >- 0 , and let (-) >- 0 be such 
that supp( 0 *)\supp( 0 ) 7- 0 - Then 

KL(q e ,\\q e ) > ^ log(c©.). 

The proof of Theorem 2 stems from the explicit relation¬ 
ship between the entries of 0* and the conditional correlations 
between corresponding pairs of variables. Note that the con¬ 
dition supp(0*)\supp(0) 7^ 0 is necessary for the validity 
of the theorem; we could otherwise take 0 = 0* to make the 
KL divergence equal to zero. 

Remark: From the point of view of graphical model estima¬ 
tion, Theorem 2 provides a strong cautionary message that if 
at least one edge in the true graph with edge set supp(0*) is 




missing, the KL divergence between q©» and the best possible 
fit is lower-bounded by the constant | log(c©») > 0. Indeed, 
Theorem 2 guarantees that if G* = G(0*) is the true graphical 
model and G is any other graph with E(G*)\E(G) qf 0, then 

min KL(q@*\\q e ) > ^ log (c©.). 

0^0: supp(0)C£(G) 2 

Theorem 2 is an important partner result to the theoretical 
conclusions of Zhou et al. [14], where an upper bound is 
provided on the Hamming distance between the edge sets of 
the true graphical model and the graphical model estimated 
by their algorithm. Our theorem states that whenever the 
Hamming distance between the edge sets is at least one, the 
KL divergence between the true distribution and the closest 
distribution in the estimated class is already bounded below 
by a constant. This emphasizes the importance of selecting 
the true edge set (or a superset thereof) when estimating the 
structure of the graphical model. 


Specializing Theorem 2 to the class of matrices (4(«, h), 
we have the following simple corollary: 

Corollary 1. Suppose 0* £ (4 (cc, h). 7/0^0 is such that 
supp(0*)\supp(0) ^ 0, then 

Ki(9e-llw)>llog( n H_ 

Remark: Note that Corollary 1 only requires the true inverse 
covariance matrix 0* to lie in 14 (a,/i), whereas 0 may be 
inside or outside the class. The conclusion of the corollary 
suggests that the ratio ^ may be interpreted as a type of 
(normalized) minimum signal strength for the true distribution 
< 7 ©*. Indeed, as 1, the KL divergence between the true 

distribution and all alternative distributions with the incorrect 
graphical structure grows unboundedly. Since the lower bound 
on the KL divergence increases in a for a fixed value of h, 
Corollary 1 further corroborates the notion that a type of 
“strong faithfulness” condition on the tme inverse covariance 
matrix makes the problem of edge estimation more tractable 
for Gaussian graphical models [20]. However, whereas the 
idea of strong faithfulness was previously introduced in 
order to quantify the success of specific statistical estimation 
algorithms. Corollary 1 establishes that a separation between 
the zero and non-zero values of 0 * actually measures the 
intrinsic hardness of the graphical model selection problem in 
an information-theoretic sense. 


(edgewise) property, and its dependence on the conditional 
correlation terms appearing as entries of 0 * takes into account 
the behavior of other nodes in the graph, as well. 


Our results on KL separation also have useful consequences 
regarding the sample complexity of a likelihood-based model 
selection procedure. Suppose the tme inverse covariance ma¬ 
trix lies in the class 0* £ Clp( 7 ). Further suppose that 
we have a set of candidate graphs Q = {Go, Gi,... , Gm}. 
with E(Gq) = supp(0*) and E(Go)\E(G m ) 7 ^ 0, for all 
1 < m < M. In other words. Go is the graphical model of the 
true distribution and each of the alternative graphs is missing 
at least one edge. 

We will analyze a maximum likelihood approach, which is 
equivalent to minimizing the KL divergence between the true 
model and another distribution in the parametric class [ 21 ], 
Let 

4(0) '■= — logdet(0) + tr(E0) 

denote the negative log likelihood with respect to a distribution 
(/©, where E is the empirical covariance matrix, and let 

£(0) := E©» [4(0)] 

denote the expected value of 4 ( 0 ) with respect to (/©*. Also 
define the scores 


S(G m ) := mm { 4 ( 0 )}, VO < m < M, 

eefi F ( 7 ) : 

supp(e)C£(G m ) 


where the minimum is taken over all inverse covariance 
matrices with Frobenius norm bounded by 7 that are consistent 
with the edge structure of G m . We discuss the Frobenius 
norm bound in the remarks following Corollary 2. Note that 
computing the score of a given graph is a tractable con¬ 
vex optimization program, since both the objective function 
and constraint set flp( 7 ) are easily seen to be convex. We 
define the graph estimator G = argmin Gm {S(G m )} to be 
the minimum-scoring graph in the collection, where we are 
agnostic to the choice of graph if more than one minimizer 
exists. We then have the following result: 


Corollary 2. Suppose the data are drawn from a mul¬ 
tivariate normal distribution with covariance matrix E*, 
and suppose a set of candidate graphs Q is given, where 
|supp(G m )| <s+p for all m > 0. Suppose the sample 

size satisfies n> —■ A,{ lax (E*)(p + s) logp. Then with 

c e* • ^ 

probability at least 1 — cexp(— c' logp), we have G = G(0*). 


We may observe easily from the proof of Corollary 1 
that equality is achieved in the KL bound when a single 
2 x 2 submatrix of 0 * corresponding to indices i qf j 
has diagonal entries equal to h and off-diagonals equal to 
±a, since the parameter c©» is computed as a minimum 
over all 2 x 2 submatrices. However, as explored in more 
detail in Section IV, the separation in KL divergence does 
not necessarily scale with the size of the edge discrepancy 
between G(0*) and G(0). In the results of that section, we 
provide examples where an increase in the Hamming distance 
between the two graphs does not substantively affect the 
minimum KL divergence between qf and the best alternative 
model. This emphasizes the fact that c©* is not purely a local 


Remark: It is helpful to compare Corollary 2 with the required 
sample size for related results on Gaussian graphical model 
selection. We first compare our result to the graphical model 
selection guarantees of Ravikumar et al. [9]. Although the 
sample size scaling n £3 (p+s) logp required by our corollary 
is somewhat stronger than the n £3 d 2 log p requirement of 
Ravikumar et al. [9], where d denotes the degree of the graph 
Go, we do not impose any of the irrepresentible conditions that 
are rather restrictive and somewhat uninterpretable. Similarly, 
nodewise regression methods [6] are consistent for model 
selection under the milder sample size scaling n f ci log p, 
but under more stringent incoherence assumptions. Note that 
in our result, the constant c©* takes the role of a beta-min 




condition, assumed by previous authors in order to derive 
model selection consistency. 


We now discuss the parameter 7 that bounds the Frobenius 
norm of inverse covariance matrices in our model class. This 
additional parameter is somewhat undesirable if we expect the 
Frobenius norm to scale with p (e.g., for jointly independent 
random variables), since it creates an even larger factor in 
the sample size requirement; however, it is the same assump¬ 
tion imposed for the purpose of Gaussian graphical model 
estimation in Zhou et al. [14], Some matrix norm bound on 
the class of inverse covariance matrices under consideration is 
certainly necessary, although we are unsure whether one can 
do better. Furthermore, it is hard to compare our Frobenius 
norm assumption directly with the -operator norm bounds 
on population-level matrices appearing in the analyses of alter¬ 
native methods [6], [9], We note the useful observation from 
Zhou et al. [14] that if the diagonal entries of E* are known 
a priori, we may replace the estimate E of the covariance 
matrix S* with the matrix E in the definition of t rl ( 0), 
where E has the correct diagonal entries. Then a sharper 
analysis leads to the slightly milder sample size requirement 
n > iC 2 7 XL (E*)slogp. However, the assumption that 

c e» 

the diagonal entries are known exactly may be too strong in 
practical applications. 


IV. Extensions and counterexamples 

Theorem 2 shows that if the estimated graph is missing 
at least one edge, the KL divergence between the true and 
estimated distributions is bounded below by a constant. In 
general, we may study the problem of evaluating a lower 
bound L{d ) for the case of d > 1 missing edges. The value of 
L( 1) is given by Theorem 2 and Corollary 1. It is reasonable 
to conjecture that L(d) scales with ci; such a scaling would 
make it possible to relate the Hamming distance between 
two graphs to the KL divergence between pairs of probability 
distributions supported on the respective graphs. In this section, 
however, we show that L{d) does not scale in a meaningful 
way with d. We present an explicit family of graphs for which 
L( 1) < L(d) < C, for some constant C that is independent 
of d. This shows that the constant bound from Theorem 2 is 
essentially tight. 

We begin with the statement of Theorem 3, which gener¬ 
alizes the result from Theorem 1. 

Theorem 3. Let X = (Xi,...,X p ) be as in Theo¬ 

rem 1. Let 0i be the inverse covariance matrix of q±, and 
let G 1 = (V, Ei) be the corresponding graph. Without 

loss of generality, consider the vertex 1 and d > 1 of 
its neighbors {2, 3, ..., d + 1}. Let G = (V, E), where 
E := {(1, 2), (1, 3), ..., (1, d + 1)} . Let <72 be any distribu¬ 
tion with the corresponding graphical model G 2 = (V, E- 2 ), 
such that E 2 C E. The following inequality holds: 

KL(qi\\q 2 ) > I(X i; X 2 , .... X d+1 \X d+2 , X p ). (2) 

Equality is achieved when q 2 = is defined by 

ql(xi,. ,.,x n ) = q(xi\x d+2 , ...,x p ) 

■ q(x 2 ,.. .,x d+ i\x d+ i, ...,x p )- q(x d+2 , ■ ■ -,x p ). 


An illustration of Theorem 3 is provided in Figure 1 . Note that 
analogous to the statement of Theorem 1, Theorem 3 does not 
impose any distributional assumptions on < 7 ! or q 2 . 

Remark: In both Theorems 1 and 3, the candidate distributions 
<72 are identified via the support of 0 2 , and the particular 
structure of supp( 02 ) allows us to express <72 in a convenient 
product form. Such a property does not hold for any arbitrary 
choice of supp( 02 ), however, although it holds for the support 
structures considered in Theorems 1 and 3. In fact, we may 
generalize the statement of Theorems 1 and 3 to include any 
graphical structure where there exists a directed acyclic graph 
reflecting all conditional independence relationships present 
in supp( 0 2 ). 



Fig. 1. An illustration of Theorem 3. Panel (a) shows the graph Gi = G(0i), 
where the neighbors of node 1 include the nodes {2,...,<i+l}. Note that 
node 1 may also have other neighbors, and the remaining nodes may be con¬ 
nected arbitrarily. Panel (b) shows a graph G 2 having the property that edges 
{(1, 2),..., (1, d + 1)} are missing. Again, we do not impose any restrictions 
on the presence or absence of other edges in the graph. Theorem 3 implies 
that the KL divergence between <71 and any distribution 92 with graphical 
model G 2 is bounded below by I(X\ \ X 2 , ..., A'd+i X r /+2 , ■ ■ ■, X jt ). 


Using Theorem 3, we provide an example showing that the 
lower bound L(d) on the KL divergence for pairs of graphs 
differing by d edges may be bounded above by a constant: 

Example. Let d > 1. Pick a (d + 1) -dimensional Gaussian 
random variable X = (Xi, X 2 , ■ ■ ■, X d+ i) with a distribu¬ 
tion < 71 , as follows: The random variables (Xi,..., X d ,W) 
are independent standard normal random variables, and 
X d+ i = Xi + W. Let the inverse covariance matrix of 

qi be 0i, and let G\ = (V, Ei) denote the corresponding 
graph. We may check that 


- 2 1 

1 2 


1 1 
-1 -1 


1 -r 
1 -1 


2 -1 

-1 1 


and consequently, 

{(l, 2 ),...,(l,d+l)} CEl 


Now choose a distribution q^ as per Theorem 3, so 


ql{xi,...,x p ) = qi(xi) ■ q 1 (x 2 ,...,x d+1 ). 















Note that the graph G 2 of q 2 does not have the edges 
{(1) 2 ),..., (1, d + 1)}, so it differs from G\ by at least d 
edges. By the result of Theorem 3, the KL divergence between 
qi and q 2 is given by 

KL(q* 2 \\q 1 ) = I(X 1 -,X 2 ,...,X d+1 ) 

= I(Xp, X 2 ,..., X d )+I(X 1 ;X d+1 \X 2 , ...,X d ) 

= I(X i; X d+1 \X 2 ,...,X d ) 

(b) 1 , „ 

= 2 log 2 , 

where in (a), we use the fact that A'i _LL (X 2 ,..., X d ) by 
construction, and in ( b ), we use Lemma 1. Our example shows 
that L(d) < | log 2 for all d > 1. Note that the lower bound 
appearing in Theorem 2 is equal to | log (|) in this case and 
is achieved, e.g., when only edge ( 1 , 2 ) is removed. 

V. Discussion 

We have characterized the KL divergence between multi¬ 
variate Gaussian distributions with edge discrepancies in the 
corresponding graphical models. Our constant-valued lower 
bound on the KL divergence between distributions when the 
graphs differ by even a single edge has both positive and 
negative implications: On the positive side, it provides upper 
bounds on the required sample complexity of model selection 
when presented with a collection of sparse candidate graphs 
containing the truth; on the negative side, our result implies 
that the fitted distribution will always be separated from the 
true distribution by a constant in terms of KL divergence when 
the edges are misspecified. This emphasizes the importance 
of selecting the correct graph when model selection and 
parameter estimation are performed sequentially. 

Future research directions include the following: Due to 
the parallel results for Gaussian and Ising graphical models 
appearing in the literature, it would be interesting to use 
Theorem 1 to derive lower bounds on the KL divergence 
between Ising distributions with different edge structures in 
terms of the parameters of the underlying distribution. We 
conjecture that for Ising models, the KL divergence will also 
be bounded below by a constant when the graphical models 
differ by at least one edge, although the analysis may be more 
complicated. Furthermore, it would be interesting to derive 
upper bounds on the KL divergence between models in both 
the Gaussian and Ising cases, which would lead to lower 
bounds on the sample complexity necessary for accurate edge 
recovery. A smattering of such results appears in the literature, 
but the picture seems far from complete. On a more ambitious 
note, it would be interesting to rigorize the tradeoff between 
sample and computational complexity for parameter estimation 
in graphical models, since one could always use fewer samples 
to obtain a larger superstructure of the true edge structure, at 
the expense of a higher computational complexity in fitting the 
parameters to a larger set of estimated edges. 
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Appendix A 
Proof of Theorem 1 


Appendix B 
Proof of Theorem 2 


Let q-2 be the density of a distribution on A for which 
X-[ _LL X2 | X{i 2 }c, and let 91 and 92 denote the marginal 
distributions on (X3,..., X p ) with respect to the distributions 
91 and q2, respectively. We have 


KL(qi\\q 2 )= f q 1 (x)\og < ^j^-dx 
Jrp Q 2 {x) 


= / 91(2:) log 


qi(x 1 ,X2\x {1 , 2 }c)qi(x {1 ,2}c) 

Q2(^l|^{l,2}‘=)92(^2|*{l,2}0'?2(*{l,2}=) 


= KL(qi\\q 2 ) + / 91(2:) 

JRP 


log 


qi (x 1 ,x 2 \x{i i 2 y) 


q2{xi\x {1 ^}o)q2{x 2 \x {1:2 }c 


dx 


-dx 
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qjx) log 

JRP 

= / 91(20 log 


qi ( 2 : 1 , x 2 |3:{i,2} c ) 

92(2:1 |a;{i,2} c )92(2;2|2:{i ; 2}0 

qi{x 1 ,x 2 \x {ia y) 

9i(2:i|2:{i,2}c)9i ( 2:2 ^{i^) 


dx 


dx 


f , s, qi(xi\x{i, 2 }c)qi{x 2 \x {h 2 }c) , 

+ / 9 i(x)log— 7 —r-t— — —7—r-t —'—dx 


92(a:i|x{ li 2}<=)92(2:2|2:{i,2}c) 


= I(A i; A 2 |A 3 ,...,A p ) 
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, v, 9i(a:i|2;{i, 2 }c)9i(a;2|a; { i,2}0 , 

qi{x) log — 7 — ;- 7 — 7 — ;- zdx. 


q 2 (xl\x {lt 2 }c)q 2 (X 2 \x {li 2 }o) 

Note that the conditional mutual information is constant with 
respect to 91. We claim that the second term is always 
nonnegative. Indeed, we may break up the term as 

f , x, qi(xi\x {lt 2 }p) , 

/ 9l(2:) log—-—!=- '—dx 

Jrp 9 2 (a:i|a;{i j 2 }=) 
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+ / 9i(a:) log ——:- -dx 
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f ( M qJ X l\X{l,2}p) , 

= / 9i(2:{2} c ) log—7—j- -dxidx 3 ---dx p 
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9i(a:i|a:{i i 2}o)log—— -retei dqi(x {h2 }c) 
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92(2;i |X{ 1 , 2}0 
, , ,, 9i(2: 2 |a:{i 2 }c) \ 

qi{X2\X{1 ,2}c) log —7-j- -dx2 dqi(X{i 2}c)- 

: 92(2:2 2: {1j2} c) J 


We begin by proving the following lemma, which we derive 
via a direct computation. A similar result may be found in 
Proposition 17 of Anandkumar et al. [ 15 ], but we provide the 
full details here for completeness. 

Lemma 1 . Let X = ( A]..... X p ) be drawn from a multi¬ 
variate normal distribution with inverse covariance matrix 0. 
Then 


I(Xi; X 2 \X 3 ,..., X p ) 



( 0 ( 1 , 1 ) 0 ( 2 , 2 ) \ 

\0(lil)0(2,2) — 0(1, 2) 2 / 


where the mutual information is computed with respect to 90- 


Proof: We begin with some notation. Let (X ]. A r 2 ) = U 
and (A3, ..., X p ) = V. Let the covariances of A, U, and V 
be denoted by Eax, ^uu, and Eyy respectively. The cross¬ 
covariance of U and V is denoted by E uv- Note that 


E uv 


^Xj_V 

^x 2 v, 


where E x, v stand for the cross covariance matrices of A, and 
V, for i £ { 1 , 2 }. We have 
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Since (A 1; ..., X p ) are jointly Gaussian, the mutual infor¬ 
mation term may be computed as 

/(A 1 ;A2|A {1) 2 } c) = iT(A 1 |A {li 2 } c)+ J ff(A 2 |A {li 2 } =) 

-JT(A 1 ,A 2 |A {1;2}c ) 

= H( ArlV) + 7 T(A 2 |V) - H(U\V ) 

= 7 log |Et/cr(l, 1 ) - Y, Xl v'E‘vv^‘X 1 v\ 

+ 2 2) — E.YaV^yy^A'avl 
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Given a block matrix 


Finally, note that the two inner integrals are expressions for 
the KL divergence between the conditional distributions of 
Ai|A {1 ,2}. and A2|A{ li2 } c , when A is distributed according 
to 91 and 92, respectively. Hence, both integrals are nonnega¬ 
tive. We conclude that inequality ( 1 ) holds. 

In order for equality to be satisfied, note that we require 
KL(qi\\q2) = 0 and the conditional KL terms to be equal to 
0 for each value of (x 3 ,... ,x p ), meaning 

92(2:1 IX3, ...,x p ) = 9i(xi|x 3 , .. .,x p ), 

and 

92(2:2 (2:3, ...,x p ) = 9i(xi|x 3 , . . .,Xp). 

This uniquely defines the distribution g 2 . 


the Schur complement of D is given by A — BD 1 C, and 
\A-BD- X C\ = i|§|L. 

Note that £W(1,1) - SxivEyyE x iV is the Schur 

C 2 ) 1 

complement of Eyy in the matrix x xx , which is E^x 
with the second row and second column removed. Similarly, 
S[7{/(2, 2 ) — Yix 2 v'^‘v\'^x 2 v ^ Schur complement of 

E vv in Ej V , which is ^xx with the hrst row and first 
column removed. The final term E jju ~ ^uv^vv^uv i s 
the Schur complement of E y y in E^xl by the block matrix 
inversion formula, it equals 

0 ( 1 , 1 ) 0 ( 1 , 2 )' 

0 ( 2 , 1 ) 0 ( 2 , 2 ) ' 



























Thus, we obtain 


by the Cauchy-Schwarz inequality. Also, 
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with probability at least 1 — cexp(— c' logp), using standard 
Gaussian tail bounds [22], It follows that 


S(G 0 ) < f(0*) + C' 7 A max (S*) 


(p + s) log p 


Now we are ready to derive the main result. Note that 
by assumption, there exists ( i,j ) £ supp( 0 *)\ supp( 0 ) with 
i ^ j. Hence, Xi _LLe Xj \ Xuj\c, where _U_e denotes 
conditioning with respect to the distribution g©. Then 

KL(q e *\\qe ) > I(Xi-Xj\X {iJ} o) > ^log(c©»), 

where the first inequality follows by Theorem 1 and the 
second inequality by Lemma 1, and the mutual information is 
computed with respect to go* ■ This is the desired inequality. 

Appendix C 
Proof of Corollary 1 

Note that since 0^0 for all 0 £ ^^(a,/!), we have 
a < h. Hence, 


and 
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©Gf2 F (7): V Tl 

supp(0)C E(Gm) 

for all to > 1, with the same probability. Finally, by Theo¬ 
rem 2 , we have 

min {£(©)} — ■?(©*) > Co*, VI < to < M. 
oen F (i): 
supp (0)CB(G m ) 

We conclude that if 2C r )\ max (E*) lo s P < c©*, then 

G = Gq- Rearranging yields the desired sample size require¬ 
ment. 
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The result is then an immediate consequence of Theorem 2. 


Appendix E 
Proof of Theorem 3 

Let gi and <72 denote the marginal distributions on 
(xd+ 2 , • ■ •, Xp) with respect to the distributions q\ and g 2 , 
respectively. For i < j, let x-i-j denote the (j — i + 1)- 
dimensional vector (x{,... ,Xj). Then 


Appendix D 
Proof of Corollary 2 


Note that 
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and for to > 1 , we have 
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where E* := 0* 1 . Furthermore, for all 0 satisfying 
supp(0) C E(G m ) for some to > 0, we have 
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nonnegative. Indeed, we may break up the term as 
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Finally, note that the two inner integrals are expressions for 
the KL divergence between the conditional distributions 

A'l | {X(d+2):p = x (d+2):p) 

and 

^ 2 :(d+l) | (X( d+2 ) :p = X ( d+2 ) :p ), 

when X is distributed according to qi and c/ 2 , respectively. 
Hence, both integrals are nonnegative. We conclude that in¬ 
equality ( 2 ) holds. 

Note that equality is achieved exactly when 
KL(qi\\q 2 ) = 0 and the conditional KL terms are equal 
to 0 for each value of (x d + 2, ■ ■ •, ai p ). Then 

92 ( 211131 ( 1 + 2 , ...,X p ) = qi(Xi\x d+2 , ■ . -,Xp), 

and 

Q2 (*^2 1 • • • 5 3'd-\-l l^cZ+2 5 • • • ? *^p) 

= Ql (*^2 5 • • • 5 %d -\-1 |*^d+2 5 • • • 5 *^p) 1 

which uniquely determines the distribution q%. 








